Welcome everybody to deep learning. So today we want to look into further common practices
and in particular in this video we want to discuss architecture selection and hyperparameter
optimization.
And you know nothing in machine learning is exact right?
However the test data is still in the vault. We are not touching it. However we need to
set our hyperparameter somehow and as you've already seen there's an enormous amount of
hyperparameters.
You have to select an architecture, a number of layers, number of nodes per layer, activation
functions and then you have all the parameters in the optimization, the initialization, the
loss function and many more.
The optimizers still have options like the type of gradient descent, momentum, learning
rate decay, batch size, in regularization you have different regularizers, a 1 and a
2 loss, batch normalization, dropout and so on.
You want to somehow figure out all the parameters for those different kinds of procedures.
Now let's choose an architecture and a loss function.
The first step would be to think about the problem and the data. How could features look
like?
What kind of spatial correlation do you expect?
What data augmentation makes sense?
How will the classes be distributed?
What is important regarding the target application?
Then you start with simple architectures and loss functions and of course you do your research.
Try well-known models first and foremost.
They are being published and there are so many papers out there.
Hence, there is no need to do everything yourself.
One day in the library can save hours, weeks and months of experimentation.
Do the research, it will really save you time.
Often they don't just publish the paper, but in very good papers it's not just the
scientific result, but they also share the code, sometimes even data.
Try to find those papers.
This can help you a lot with your experimentation.
So then you may want to change and adapt the architecture to your problem.
If you change something, find good reasons why this is an appropriate change.
There are quite a few papers out there that seem to introduce random changes into the
architecture.
Later, it turns out that the observations that they made were essentially random and
they were just lucky or experimented enough on their own data in order to get the improvements.
Typically there is also a reasonable argument of why the specific change should give an
improvement in performance.
Next you want to do your hyperparameter search.
So you remember learning rate decay, regularization dropout and so on.
These have to be tuned.
Still the networks can take days or weeks to train and you have to search for these
hyperparameters.
Hence, we recommend using a log scale.
So for example for EDA you go for 0.1, 0.01 and 0.001.
You may want to consider a grid search or random search.
In a grid search you would have equal distance steps and if you look at reference 2, they
have shown that a random search has advantages over the grid search.
First of all it's easier to implement and second it has a better exploration of the
Presenters
Zugänglich über
Offener Zugang
Dauer
00:09:18 Min
Aufnahmedatum
2020-10-12
Hochgeladen am
2020-10-12 13:56:47
Sprache
en-US
Deep Learning - Common Practices Part 2
This video discusses the use of validation data and how to choose architectures and hyper parameters and discuss ensembling.
For reminders to watch the new video follow on Twitter or LinkedIn.
Further Reading:
A gentle Introduction to Deep Learning